New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new pre-trained models BERTweet and PhoBERT #6129
Conversation
Re-add `bart` to LM_MAPPING
Re-add `from .configuration_mobilebert import MobileBertConfig` not sure why it's replaced by `from transformers.configuration_mobilebert import MobileBertConfig`
Remove BertweetTokenizer and PhobertTokenizer out of tokenization_auto.py (they are currently not supported by AutoTokenizer.
Whether I can get any support from huggingface w.r.t. this pull request @julien-c ? Thanks. |
Hello @datquocnguyen ! As you've said, BERTweet and PhoBERT reimplement the RoBERTa model without adding any special behavior. I don't think it's necessary to reimplement them then, is it? Uploading them on the hub should be enough to load them into RoBERTa architectures, right? |
Hi @LysandreJik |
I hope both BERTweet and PhoBERT could be incorporated into |
Yes, I understand, that makes sense. There shouldn't be any issue in incorporating them into |
I've taken a quick look at it, and it looks very cool! Something that we can maybe do better, is regarding the tokenizers:
Let me know what you think! |
Haven't tried it directly, but as seen with @n1t0, since you're not doing any fancy pre-processing it might be as simple as the following: class PhobertTokenizerFast(PreTrainedTokenizerFast):
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ["attention_mask"]
def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
kwargs.setdefault("unk_token", unk_token)
super().__init__(
CharBPETokenizer(vocab_file=vocab_file, merges_file=merges_file, unk_token=unk_token, lowercase=False, bert_normalizer=False, split_on_whitespace_only=True),
**kwargs,
) |
Thanks very much @LysandreJik I will revise the code following your comments and inform you as soon as I complete it. |
@datquocnguyen Yeah, these models are cool. Lovin' it. I think we can try to figure out how to convert |
Yes. Thanks @JetRunner |
some tokenizer function (decode, convert_ids_to_tokens) hasn't implemented for PhoBertTokenizer right? |
@datquocnguyen Thank you for this pull request. I tried the Bertweet model and met a problem that the tokenizer encoded special symbols like "<pad>" not as a whole token. Instead, it would split the string into characters like "< p a d >". I fixed the problem by modifying the code at `` as below: --- a/BERTweet/transformers/tokenization_bertweet.py
+++ b/BERTweet/transformers/tokenization_bertweet.py
@@ -242,9 +242,14 @@ class BertweetTokenizer(PreTrainedTokenizer):
text = self.normalizeTweet(text)
return self.bpe.apply([text])[0].split()
- def convert_tokens_to_ids(self, tokens):
- """ Converts a list of str tokens into a list of ids using the vocab."""
- return self.vocab.encode_line(" ".join(tokens), append_eos=False, add_if_not_exist=False).long().tolist()
+ def _convert_token_to_id(self, token):
+ #""" Converts a list of str tokens into a list of ids using the vocab."""
+ #return self.vocab.encode_line(" ".join(tokens), append_eos=False, add_if_not_exist=False).long().tolist()
+ return self.vocab.encode_line(token, append_eos=False, add_if_not_exist=False).long().tolist()[0]
+
+ @property
+ def vocab_size(self) -> int:
+ return len(self.vocab) From my understanding, to encode a sentence, the order of the interfaces called in this case are |
I will have a look soon. Thanks @Miopas. |
I have just tried "BertweetTokenizer" and got this error: "ImportError: cannot import name 'BertweetTokenizer' from 'transformers' (/home/apps/anaconda3/lib/python3.7/site-packages/transformers/init.py)" Is there any solution to it? I have also tried: tokenizer2 = BertTokenizer.from_pretrained("vinai/bertweet-base") and got: Is there any solution to it? thks! |
…izer and test files
Codecov Report
@@ Coverage Diff @@
## master #6129 +/- ##
==========================================
- Coverage 80.32% 80.08% -0.25%
==========================================
Files 168 170 +2
Lines 32285 32642 +357
==========================================
+ Hits 25932 26140 +208
- Misses 6353 6502 +149
Continue to review full report at Codecov.
|
@datquocnguyen can you also upload your model files on https://huggingface.co/vinai/bertweet-base I still get this error:
|
@datquocnguyen I looked a the PR and looking forward to this merge. I have a few suggestions:
|
Hi @napsternxg The model had been already uploaded to https://huggingface.co/vinai/bertweet-base. For now, you would have to install
Thanks for your suggestions. BertweetTokenizer is specifically designed to work on Tweet data, incorporating a TwitterTokenizer while PhoBERT does not. Note that both our |
Btw, I should mention that BERTweet is accepted as an EMNLP-2020 demo paper while PhoBERT gets a slot in the Findings of EMNLP-2020 volume. Please help review this pull request so that others might benefit from using them directly from the master branch of |
Thanks that makes sense. |
@napsternxg Please remove your "transformers" cache folder from import torch
from transformers import AutoModel, AutoTokenizer
bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = bertweet(input_ids) # Models outputs are now tuples |
@datquocnguyen great work and I am looking forward to seeing the PR gets merged so that I can use the models directly from the huggingface transformers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think this is great, I have nothing to add. LGTM, thanks for adding tests!
Will merge today unless @julien-c, @JetRunner have comments. |
LGTM, do not hesitate to make the tokenizers as generic/configurable as possible, but this can be in a subsequent PR |
* Add BERTweet and PhoBERT models * Update modeling_auto.py Re-add `bart` to LM_MAPPING * Update tokenization_auto.py Re-add `from .configuration_mobilebert import MobileBertConfig` not sure why it's replaced by `from transformers.configuration_mobilebert import MobileBertConfig` * Add BERTweet and PhoBERT to pretrained_models.rst * Update tokenization_auto.py Remove BertweetTokenizer and PhobertTokenizer out of tokenization_auto.py (they are currently not supported by AutoTokenizer. * Update BertweetTokenizer - without nltk * Update model card for BERTweet * PhoBERT - with Auto mode - without import fastBPE * PhoBERT - with Auto mode - without import fastBPE * BERTweet - with Auto mode - without import fastBPE * Add PhoBERT and BERTweet to TF modeling auto * Improve Docstrings for PhobertTokenizer and BertweetTokenizer * Update PhoBERT and BERTweet model cards * Fixed a merge conflict in tokenization_auto * Used black to reformat BERTweet- and PhoBERT-related files * Used isort to reformat BERTweet- and PhoBERT-related files * Reformatted BERTweet- and PhoBERT-related files based on flake8 * Updated test files * Updated test files * Updated tf test files * Updated tf test files * Updated tf test files * Updated tf test files * Update commits from huggingface * Delete unnecessary files * Add tokenizers to auto and init files * Add test files for tokenizers * Revised model cards * Update save_vocabulary function in BertweetTokenizer and PhobertTokenizer and test files * Revised test files * Update orders of Phobert and Bertweet tokenizers in auto tokenization file
Any news on it? when Phobert available on HuggingFace? |
It's been available since September: from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")
model = AutoModelForMaskedLM.from_pretrained("vinai/phobert-base") You can see the model card here. |
But i don't see here https://huggingface.co/transformers/pretrained_models.html |
PhoBERT is based off of the RoBERTa implementation, so you can load it in a I have never used Rasa NLU, so I can't help you much here. Your best option would be to open a thread on our forum with an example of how you do things for other models, so as not to flood this PR. You can ping me on the thread (@Lysandre). |
I'd like to add pre-trained BERTweet and PhoBERT models to the
transformers
library.Users now can use these models directly from
transformers
. E.g:BERTweet: A pre-trained language model for English Tweets
PhoBERT: Pre-trained language models for Vietnamese